Delft University of Technology Knowledge sharing in manufacturing using LLM-powered tools User study and model benchmarking Kernan Freire, Samuel; Wang, Chaofan; Foosherian, Mina; Wellsandt, Stefan; Ruiz-Arenas, Santiago ; Niforatos, Evangelos DOI 10.3389/frai.2024.1293084 Publication date 2024 Document Version Final published version Published in Frontiers in Artificial Intelligence Citation (APA) Kernan Freire, S., Wang, C., Foosherian, M., Wellsandt, S., Ruiz-Arenas, S., & Niforatos, E. (2024). Knowledge sharing in manufacturing using LLM-powered tools: User study and model benchmarking. Frontiers in Artificial Intelligence , 7, Article 1293084. https://doi.org/10.3389/frai.2024.1293084 Important note To cite this publication, please use the final published version (if applicable). Please check the document version above. Copyright Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim. This work is downloaded from Delft University of Technology. For technical reasons the number of authors shown on this cover page is limited to a maximum of 10. TYPEBrief Research Report PUBLISHED /two.tnum/seven.tnum March /two.tnum/zero.tnum/two.tnum/four.tnum DOI/one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum OPENACCESS EDITEDBY David Romero, Monterrey Institute of Technology and Higher Education (ITESM), Mexico REVIEWEDBY Elena Bellodi, University of Ferrara, Italy Anisa Rula, University of Brescia, Italy *CORRESPONDENCE Samuel Kernan Freire s.kernanfreire@tudelft.nl RECEIVED /one.tnum/two.tnum September /two.tnum/zero.tnum/two.tnum/three.tnum ACCEPTED /one.tnum/four.tnum March /two.tnum/zero.tnum/two.tnum/four.tnum PUBLISHED /two.tnum/seven.tnum March /two.tnum/zero.tnum/two.tnum/four.tnum CITATION Kernan Freire S, Wang C, Foosherian M, Wellsandt S, Ruiz-Arenas S and Niforatos E (/two.tnum/zero.tnum/two.tnum/four.tnum) Knowledge sharing in manufacturing using LLM-powered tools: user study and model benchmarking. Front. Artif. Intell. /seven.tnum:/one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum. doi: /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum COPYRIGHT ©/two.tnum/zero.tnum/two.tnum/four.tnum Kernan Freire, Wang, Foosherian, Wellsandt, Ruiz-Arenas and Niforatos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.Knowledge sharing in manufacturing using LLM-powered tools: user study and model benchmarking Samuel Kernan Freire/one.tnum*, Chaofan Wang/one.tnum, Mina Foosherian/two.tnum, Stefan Wellsandt/two.tnum, Santiago Ruiz-Arenas/three.tnumand Evangelos Niforatos/one.tnum /one.tnumFaculty of Industrial Design Engineering, Delft University of Te chnology, Delft, Netherlands, /two.tnumBIBA—Bremer Institut für Produktion und Logistik GmbH, Bremen, Ge rmany,/three.tnumGrupo de Investigación en Ingeniería de Diseño (GRID), Universidad EAFIT - Escuela de Administración, Finanzas e Instituto Tecnológico, Medellin, Colombia Recentadvancesinnaturallanguageprocessingenablemoreinte lligentwaysto support knowledge sharing in factories. In manufacturing, oper ating production lineshasbecomeincreasinglyknowledge-intensive,puttings trainonafactory’s capacity to train and support new operators. This paper introdu ces a Large Language Model (LLM)-based system designed to retrieve inf ormation from the extensive knowledge contained in factory documentation and k nowledge shared by expert operators. The system aims to efficiently answer queries from operators and facilitate the sharing of new knowledge. We condu cted a user study at a factory to assess its potential impact and adoption, eli citing several perceived benefits, namely, enabling quicker information ret rieval and more efficient resolution of issues. However, the study also highligh ted a preference forlearningfromahumanexpertwhensuchanoptionisavailable .Furthermore, we benchmarked several commercial and open-sourced LLMs for this system. The current state-of-the-art model, GPT-/four.tnum, consistently out performed its counterparts, with open-source models trailing closely, prese nting an attractive optiongiventheirdataprivacyandcustomizationbenefits.Ins ummary,thiswork offers preliminary insights and a system design for factories co nsidering using LLM tools for knowledge management. KEYWORDS natural language interface, benchmarking, Large Language Mode ls, factory, industrial settings,industry/five.tnum./zero.tnum,knowledgesharing,informationretriev al /one.tnum Introduction Human-centric manufacturing seeks to harmonize the strengths of humans and machines, aiming to enhance creativity, human wellbeing, problem-solving abilities, and overall productivity within factories ( May et al., 2015 ;Fantini et al., 2020 ;Alves et al., 2023).Despitetheseadvancements,asignificantchallengepersistsineffectivelymanaging and utilizing the vast knowledge generated within these manufacturing environments, such as issue reports and machine documentation ( Gröger et al., 2014 ). This knowledge is crucial for optimizing operations, yet it remains largely untapped due to the difficulties in processing and interpreting the disconnected, sometimes unstructured, technical informationitcontains( Edwardsetal.,2008 ;Leonietal.,2022 ). Frontiersin ArtificialIntelligence /zero.tnum/one.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum Traditionally,leveragingthisknowledgehasbeencumbersome, with operators choosing to use personal smartphones over official procedures ( Richter et al., 2019 ) and AI unable to handle the complexity of the data ( Edwards et al., 2008 ). However, recent Large Language Models (LLMs) like GPT-4 show promise in addressing these challenges. LLMs can effectively interpret, summarize, and retrieve information from vast text- based datasets ( Lewis et al., 2020 ) while concurrently aiding the capture of new knowledge ( Kernan Freire et al., 2023b ). These capabilities could significantly support operators in knowledge- intensive tasks, making it easier to access relevant information, sharenewknowledge,andmakeinformeddecisionsrapidly. While LLMs offer promising capabilities, their application in manufacturing is not straightforward. The specific, dynamic knowledge required in this domain poses unique challenges ( Feng et al., 2017 ). For instance, a foundational LLM may have limited utility in a factory setting without significant customization, such as fine-tuning or incorporating specific context information into its prompts ( Wang Z. et al., 2023 ). Additionally, the practical and socio-technical risks and challenges of deploying LLMs in such environments remain largely unexplored—factors key to human- centered AI ( Shneiderman, 2022 ). Concerns include the accuracy of the information provided, the potential for “hallucinated" answers ( Zuccon et al., 2023 ), and the need for systems that can adapt to the highly specialized and evolving knowledge base of a specificmanufacturingsetting( Fengetal.,2017 ). Inresponsetothesechallenges,wedevelopedanLLM-powered tool to leverage factory documents and issue analysis reports to answer operators’ queries. Furthermore, the tool facilitates the analysis and reporting of new issues. This tool demonstrates the feasibility of using LLMs to enhance knowledge management in manufacturing settings. To understand its effectiveness and potential, we conducted a user study in a factory environment, evaluating the system’s usability, user perceptions, adoption, and impactonfactoryoperations. Our approach also addresses the lack of specific benchmarks for evaluating LLMs in manufacturing. We benchmarked several LLMs,includingbothclosedandopen-sourcemodels,recognizing that the standard benchmarks/one.tnumprimarily focus on general knowledge and reasoning. As such, they may not adequately reflect the challenges of understanding manufacturing-specific terminology and concepts. This benchmarking focused on their ability to utilize factory-specific documents and unstructured issue reports to provide factual and complete answers to operators’queries. /two.tnum Background In this section, we address the topic of industry 5.0, LLM- powered tools for knowledge management, benchmarking LLMs, andtheresearchquestionsinformingthiswork. /one.tnumhttps://huggingface.co/spaces/HuggingFaceH/four.tnum/open_llm_leaderboa rd (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum)./two.tnum./one.tnum Human-centered manufacturing Industry 5.0, the latest phase of industrial development, places human beings at the forefront of manufacturing processes, emphasizing their skills, creativity, and problem-solving abilities ( Xu et al., 2021 ;Maddikunta et al., 2022 ;Alves et al., 2023). Human-centered manufacturing in Industry 5.0 focuses on providingaworkenvironmentthatnurturesindividuals’creativity and problem-solving capabilities ( Maddikunta et al., 2022 ). It encourages workers to think critically, innovate, and continuously learn. With machines handling repetitive and mundane tasks, human workers can dedicate their time and energy to more complex and intellectually stimulating activities. This shift could enhance job satisfaction and promote personal and professional growth, as workers could acquire new skills and engage in higher- leveldecision-making( Xuetal.,2021 ;Alvesetal.,2023 ).Emphasis on human-machine collaboration and the continuous emergence and refinement of technology increases the need for adequate human-computer interaction ( Brückner et al., 2023 ). One of the approachestoaddressthistopicisusingconversationalAItoassist humansinmanufacturing( Wellsandtetal.,2021 ). /two.tnum./two.tnum LLM-powered knowledge management tools TrainingLargeLanguageModels(LLMs)onnumerous,diverse texts results in the embedding of extensive knowledge ( Zhao et al., 2023 ). LLMs can also adeptly interpret complex information ( Jawahar et al., 2019 ), general reasoning ( Wei et al., 2022a ), and aiding knowledge-intensive decision-making. Consequently, researchers have been exploring applying LLM- powered tools in domain-specific tasks ( Wen et al., 2023 ;Xie T. etal.,2023 ;ZhangW.etal.,2023 ). Despite their potential benefits, the responses generated by LLMs may have two potential issues: (1) outdated information originating from the model’s training date, and (2) inaccuracies in factualrepresentation,knownas“hallucinations”( Bangetal.,2023 ; Zhao et al., 2023 ). To address these challenges and leverage the capabilitiesofLLMsindomain-specificknowledge-intensivetasks, severaltechniquescanbeused,suchaschain-of-thought( Weietal., 2022b), few-shot prompting ( Brown et al., 2020 ;Gao et al., 2021 ), andretrievalaugmentedgeneration( Lewisetal.,2020 ). Using few-shot prompting to retrieve information across diverse topics, Semnani et al. (2023) introduced an open-domain LLM-poweredchatbotcalledWikiChat.WikiChatutilizesa7-stage pipeline of few-shot prompted LLM that suggests facts verified against Wikipedia, retrieves additional up-to-date information, and generates coherent responses. They used a hybrid human- and-LLM method to evaluate the chatbot on different topics for factuality, alignment with real-worth truths and verifiable facts, and conversationality. This compound metric scores how informational, natural, non-repetitive, and temporally correct the response is. Their solution significantly outperforms GPT- 3.5 in factuality, with an average improvement of 24.4% while staying on par in conversationality. Others have explored the capabilities of LLMs in domain-specific tasks such as extracting Frontiersin ArtificialIntelligence /zero.tnum/two.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum structured data from unstructured healthcare texts ( Tang et al., 2023), providing medical advice ( Nov et al., 2023 ), simplifying radiologyreports( Jeblicketal.,2023 ),LegalJudgementPrediction from multilingual legal documents ( Trautmann et al., 2022 ), and scientificwriting( AlkaissiandMcFarlane,2023 ). Several manufacturers are cautiously adopting LLMs, while seeking solutions to mitigate their associated risks. For example,/two.tnumused AI with ChatGPT integrated through Azure OpenAI Service to enhance quality management and process optimization in vehicle production. This AI-driven approach simplifies complex evaluations for quality engineers through dialogue-based queries. Xia et al. (2023) demonstrated how using in-context learning and injecting task-specific knowledge into an LLM can streamline intelligent planning and control of production processes. Kernan Freire et al. (2023a) built a proof of concept for bridging knowledge gaps among workers by utilizing domain-specific texts and knowledge graphs. Wang X. et al. (2023) conducted a systematic test of ChatGPT’s responses to 100 questions from course materials and industrial documents. They used a zero-shot method and examined the responses’ correctness, relevance, clarity, and comparability. Their results suggested areas for improvement, including low scores when respondingtocriticalanalysis questions,occasionalnon-factualor out-of-manufacturing scope responses, and dependency on query quality. Although Wang X. et al. (2023) provides a comprehensive review of ChatGPT’s abilities to answer questions related to manufacturing; it did not include the injection of task-specific knowledgeintotheprompts. To improve the performance of an LLM for domain-specific tasks, relevant context information can be automatically injected along with a question prompt. This technique, known as Retrieval Augmented Generation (RAG), involves searching a corpus for informationrelevanttotheuser’squeryandinsertingitintoaquery template before sending it to the LLM ( Lewis et al., 2020 ). Using RAG also enables further transparency and explainability of the LLM’sresponse.Namely,userscancheckthereferenceddocuments to verify the LLM’s response. Factories will likely have a large corpusofknowledgeavailableinnaturallanguage,suchasstandard work instructions or machine manuals. Furthermore, factory workerscontinuallyaddtothepoolofavailableknowledgethrough (issue) reports. Until recently, these reports were considered unusable by AI natural language processing due to quality issues such as poorly structured text, inconsistent terminology, or incompleteness( Edwardsetal.,2008 ;Mülleretal.,2021 ).However, the leap in natural language understanding that LLMs, such as ChatGPT,havebroughtaboutcanovercometheseissues. /two.tnum./three.tnum Evaluating LLMs Large Language Model evaluation requires the definition of evaluation criteria, metrics, and datasets associated with the system’s main tasks. There are two types of LLM evaluations: intrinsic and extrinsic evaluation. Intrinsic evaluation focuses on the internal properties of a Language Model ( Wei et al., /two.tnumhttps://group.mercedes-benz.com/innovation/digitalisation/i ndustry- /four.tnum-/zero.tnum/chatgpt-in-vehicle-production.html (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum).2023). It means the patterns and language structures learned during the pre-training phase. Extrinsic evaluation focuses on the model’s performance in downstream tasks, i.e., in the execution of specific tasks that make use of the linguistic knowledge gained upstream, like code completion ( Xu et al., 2022 ). Despite extrinsic evaluation being computationally expensive, only conducting intrinsic evaluation is not comprehensive, as it only tests the LLMs capability for memorization ( Jang et al., 2022 ). Here, we focus on extrinsic evaluation as we are primarily interested in the performanceofLLM-basedtoolsforspecificreal-worldtasks. Extrinsic evaluation implies assessing the systems’s performance in tasks such as question answering, translation, reading comprehension, and text classification, among others ( Kwon and Mihindukulasooriya, 2022 ). Existing benchmarks such as LAMBADA, HellaSwag, TriviaQA, BLOOM, Galactica, ClariQ and MMLU, among others, are widely reported intheliteratureforcomparinglanguagemodels.Likewise,domain- specificBenchmarksfortaskssuchasmedical( Singhaletal.,2023 ), fairness evaluation ( Zhang J. et al., 2023 ), finance ( Xie Q. et al., 2023), robot policies ( Liang et al., 2022 ), and 3D printing code generation ( Badini et al., 2023 ) can also be found. Experts also evaluate the performance of large-language models (LLMs) in specificdownstreamtasks,suchasusingphysicianstoevaluatethe outputofmedicalspecificLLMs( Singhaletal.,2023 ). LLM benchmarks range from specific downstream tasks to generallanguagetasks.However,toourknowledge,LLMshavenot been benchmarked for answering questions in the manufacturing domainbasedoncontextmaterial,atechniqueknownasRetrieval Augmented Generation ( Lewis et al., 2020 ). Material such as machine documentation, standard work instructions, or issue reports will contain domain jargon and technical information that LLMs may struggle to process. Furthermore, the text in an issue report may pose additional challenges due to abbreviations, poor grammar,andformatting( Edwardsetal.,2008 ;Oruç,2020 ;Müller et al., 2021 ). Therefore, as part of this work, we benchmarked several LLMs on their ability to answer questions based on factory manuals and unstructured issue reports. Furthermore, we conducted a user study with factory operators and managers to assess the potential benefits, risks and challenges. The following researchquestionsinformedourstudy: 1.What are the perceived benefits, challenges, and risks of using Large Language Models for information retrieval and knowledge sharingforfactoryoperators? 2.How do Large Language Models compare in performance when answering factory operators’ queries based on factory documentation and unstructured issue reports? We consider performanceasthefactuality,completeness,hallucinations,and concisenessofthegeneratedresponse. /three.tnum System We built a fully functional system to assess the potential of using LLMs for information retrieval and knowledge sharing for factory operators. Benefiting from LLMs’ in-context learning capabilities, we use this to supply an LLM with information in the form of factory manuals, and issue reports relevant to the user’s question, a technique known as Retrieval Augmented Generation Frontiersin ArtificialIntelligence /zero.tnum/three.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum FIGURE/one.tnum The steps of Retrieval Augmented Generation (RAG) from user quer y to response. (RAG) (Lewis et al., 2020 ), seeFigure1. As noted by Wei et al. (2022a), training LLMs using a prompt packed with query-related informationcanyieldsubstantialperformanceenhancement.Users can ask questions in the chat box by typing or using voice input. Theresponseisdisplayedatthetopofthepage,andthedocument chunks used for the answer can be checked at the bottom (see Figure2). /three.tnum./one.tnum Tool dependencies The tool was constructed utilizing two innovative technologies—Gradio and LlamaIndex. Gradio, a tool developed byAbid et al. (2019), serves as the backbone for both our front and back ends. Primarily used to simplify the development and distribution of machine learning applications, Gradio allows the quickcreationofintuitive,user-friendlywebinterfacesformachine learningmodels. Additionally, we use LlamaIndex, created by Liu(2022), for retrieving the context material in response to the user queries and handling the queries to the LLM. LlamaIndex, initially known as GPT Index, is a cutting-edge data framework designed for the efficient handling and accessibility of private or domain-specific datainLLMsapplications. Since the factory documents can be long, they may overflow the LLM’s context window or result in unnecessary computational demand. To overcome this, we segment the materials into manageable chunks, each comprising ∼400 tokens. This method effectivelyincorporatesthematerialsintotheLLMpromptwithout compromising the conversation flow. Following the segmentation, each document chunk is processed through LlamaIndex using theOpenAIEmbeddingAPI./three.tnumUtilizingthe“text-embedding-ada-002” model, LlamaIndex transforms each chunk into a corresponding embeddingvector.Theseresultingvectorsarethensecurelystored, readyforfutureretrievalanduse. /three.tnum./two.tnum Knowledge base construction Our experiment incorporates two distinct types of domain- specific data: factory manuals and shared knowledge from factory workers. Factory manuals outline information on machine operation, safety protocols, quality assurance, and more. These resources, provided by factory management teams, initialize the knowledge base for each specific factory. The materials come in variousformats,includingPDF,Word,andCSVfiles. In addition to the factory manuals, we integrate issue analysis reports from factory workers. This information is gathered from theproductionline,utilizingthefive-whyprocess,aniterativeroot- cause analysis technique ( Serrat, 2017 ) (right side of Figure2). The five-why technique probes into cause-and-effect relationships underlyingspecificproblemsbyrepeatedlyasking“Why?"untilthe root cause is revealed, typically by the fifth query. This process enables us to gather real-world issues encountered on production lines, which may not be covered in the factory manuals. Upon entering all required information, including one or more “whys”, the operator presses “check”, triggering a prompt to the LLM that performsalogicalcheckoftheenteredinformationandchecksfor inconsistencieswithpreviouslyreportedinformation.Theoperator can revise the entered information and submit it as is. Then, the submitted report will be added to a queue for expert operators to checkbeforeitisaddedtotheknowledgebase. /three.tnum./three.tnum Query construction Toretrievethedocumentdatarelevanttospecificuserqueries, weemploythesameembeddingmodel,“text-embedding-ada-002” to generate vector representations of these queries. By leveraging the similarity calculation algorithm provided by LlamaIndex, we can identify and retrieve the top-K most similar segmented document snippets related to the user query. This allows us to construct pertinent LLM queries. Once the snippets are retrieved, theyaresynthesizedintothefollowingquerytemplatebasedonthe templatesusedbyLlamaIndex/four.tnum: You are an assistant that assists detergent production line operators with decision support and advice based on a knowledge base of standard operating procedures, single point lessons(SPL),etc.Wehaveprovidedcontextinformationbelow fromrelevantdocumentsandreports. [RetrievedDocumentSnippets] Given this information, please answer the following question: [Query] /three.tnumhttps://api.openai.com/v/one.tnum/embeddings (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum). /four.tnumhttps://docs.llamaindex.ai (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum). Frontiersin ArtificialIntelligence /zero.tnum/four.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum FIGURE/two.tnum The main screens for the tool’s interface are the chat interface and issue analysis screen. The “relevant document sections” part is blurred for confidentiality as it shows the title of a company’s document and it s content. Iftheprovidedcontextdoesnotincluderelevantinformation toanswerthequestion,pleasedonotrespond. However, considering our data originates from two distinct sources—factorymanualsandsharedtacticalknowledge—wehave decided to segregate these into two separate LLM queries. This approach is designed to prevent potential user confusion from combiningdatafrombothsourcesintoasinglequery. /four.tnum User study in the field Weconductedauserstudyonthesystemtouncoverperceived benefits, usability issues, risks, and barriers to adoption. The study comprised four tasks: (1) ask the system several questions about howtosolveaspecificproductionissueand/orperformastandard procedure,(2)completea“yellowtag”(issueanalysisreport)based onarecentissue,(3)requestalogicalcheckofthecompletedreport, and finally, (4) upload new documents to the system. After each task, they were asked to provide feedback. Then, after completing all tasks, the participants were posed several open questions about the system’s benefits, risks, and barriers to adoption. Finally, demographic information, such as age, gender, and role, wascollected./four.tnum./one.tnum Participants We recruited N=9 participants from a detergent factory, of whichn=4weremanagers(P1-4),and n=5wereoperators(P5- 9). Of the nine participants, n=3 were women, and n=6 were men. Participant age was distributed over three brackets, namely n=2were30–39, n=4were40–49,and n=3were50–59. /four.tnum./two.tnum Qualitative analysis An inductive thematic analysis ( Guest et al., 2011 ) of the answers to the open questions resulted in six themes discussed below. •Usability:thethemeofusabilityunderlinesthesystem’seaseof use and the need for clear instructions. Users mentioned the necessity for a “user-friendly” (P2) interface and highlighted theimportanceofhaving“moreinstructionsandmoredetails need to be loaded” (P1) to avoid confusion. This indicates a desireforintuitivenavigationthatcouldenableworkerstouse thesystemeffectivelywithoutextensivetrainingorreferencing external help. The feedback suggests that the system already Frontiersin ArtificialIntelligence /zero.tnum/five.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum workswell,asreflectedinstatementslike“Easy-to-usesystem” (P3)andthesystem“workswell”(P7). •Access to information : users appreciated the “ease of having information at hand” (P1), facilitating immediate access to necessary documents. However, there is a clear call for improvements, such as the ability to “Include the possibility of opening IO, SPL, etc. in pdf format for consultation” (P3). This theme is supported by requests for direct links to full documents, suggesting that while “the list of relevant documents from which information is taken is excellent” (P4), the ability to delve deeper into full documents would significantlyenhancetheuserexperience. •Efficiency :usersvaluethe“greaterspeedincarryingoutsome small tasks” (P3). However, there are concerns about the system’s efficiency when it does not have the answer, leading to “wasting time looking for a solution to a problem in case itisnotreportedinthesystem’shistory”(P3).Statementslike “quickinresponses”(P3)contrastwiththeneedforquestions to be “too specific to have a reliable answer” (P7), indicating tensionbetweenthedesireforquicksolutionsandthesystem’s limitations. •Adoption : users highlight several factors affecting adopting thenewsystem.Itincludeschallengessuchas“awarenessand training of operators [might hinder adoption]” (P3) and the need for “acceptance by all employees” (P4), which indicates that the system’s success is contingent on widespread user buy-in. The generational divide is also noted: “That older operatorsuseit[onwhatmayhinderadoption]”(P7)suggests thatdemographicfactorsmayinfluencetheacceptanceofnew technology. •Safety: a manager expressed apprehension that “if the responsesarenotadequate,yourisksafety”(P1),emphasizing thecriticalnatureofreliableinformationinahigh-riskfactory setting. Beyond information being outdated or useless, the possibility of “hallucinated” responses leading to dangerous situations in a factory that processes chemicals is especially concerning. •Traditional vs. novel : there is a noticeable preference for established practices among some users. For instance, “It’s faster and easier to ask an expert colleague working near me rather than [the system]” (P8) captures the reliance on human expertise over the assistant system. This tension is further demonstrated by the sentiment that “Operators may benefit more from traditional information retrieval systems” (P9), suggesting a level of skepticism or comfort with the status quo that the new system needs toovercome. /five.tnum LLM benchmarking In our benchmarking experiment, we evaluated various commercial and open-source LLMs, including OpenAI’s ChatGPT (GPT-3.5 and GPT-4 from July 20th 2023), Guanaco 65B and 35B variants ( Dettmers et al., 2023 ) based on Meta’s Llama (Large Language Model Meta AI) ( Touvron et al., 2023 ), Mixtral 8x7b (Jiang et al., 2024 ), Llama 2 ( Touvron et al., 2023 ), andone of its derivatives, StableBeluga2/five.tnum. This selection represents thestate-of-the-artclosed-sourcedmodels(e.g.,GPT-4)andopen- sourcemodels(e.g.,Llama2).Weincludedthe(outdated)Guanaco models to demonstrate the improvements in the open-source sphereoverthepastyear. WeusedawebUIforLLMs/six.tnumtoloadandtesttheMixtral8x7B, Guanaco models, and the StableBeluga2. The models were loaded onapairofNvidiaA6000swithNVlinkandatotalVideoRandom Access Memory (VRAM) capacity of 96 GB. The 65B model was runin8-bitmodetofitintheavailableVRAM.Weusedthellama- precise parameter preset and fixed zero seed for reproducibility. Llama2wasevaluatedusingthedemoonhuggingface./seven.tnum To rigorously assess the models, we prepared 20 questions of varying complexity based on two types of context material: half from operating manuals and half from unstructured issue reports. The operating manuals included excerpts from actual machine manuals and standard operating procedures, while the informal issuereportswerefree-textdescriptionsofissueswehadpreviously collectedfromoperators.Themodelpromptwasconstructedusing the above template (3.3). Ultimately, the difficulty of a question is a combination of the question’s complexity and the clarity of the source material. Simple questions include retrieving a single piece of information clearly stated in the context material, for example, “At what temperature is relubrication necessary for the OKS 4220 grease?". Conversely, difficult questions require more reasoning or comprise multiple parts, for example, “What should I do if the central turntable is overloaded?" which has a nuanced answer dependent on several factors not clearly articulated in the context material. In addition to measuring response length in words, every response is manually scored on factuality, completeness, and hallucinationsasdefinedbelow: •Factuality : responses align with the facts in the context material. •Completeness :responsescontainalltheinformationrelevant tothequestioninthecontextmaterial. •Hallucinations : response appears grammatically and semantically coherent but is not based on the context material. The following scoring protocol is applied: one is awarded for a completely factual, complete, or hallucinated response. In contrast, a score of 0.5 is awarded for a slightly nonfactual, incomplete, or hallucinated response (e.g., the response includes four out of the five correct steps). Otherwise, a score of zero is awarded. Therefore, wrong answers are penalized heavily. If the model responds by saying it cannot answer the question and does not make any attempt to do so, it is scored zero for factuality and completeness, but no score is given for hallucination. As /five.tnumhttps://huggingface.co/stabilityai/StableBeluga/two.tnum (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum). /six.tnumhttps://github.com/oobabooga/text-generation-webui/tree/main (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum). /seven.tnumhttps://huggingface.co/meta-llama/Llama-/two.tnum-/seven.tnum/zero.tnumb-chat-hf (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum). Frontiersin ArtificialIntelligence /zero.tnum/six.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum FIGURE/three.tnum Benchmark of seven LLMs for generating answers based on factory ma terials. TABLE/one.tnum Modelbenchmarkingscores(outof/one.tnum/zero.tnum/zero.tnum)andaverageresponselength. Model Factuality Completeness Hallucinations Words GPT-4 97.5 95 0 69 StableBeluga2 95 92.5 7.5 58 Mixtral8x7B 92.5 92.5 2.5 66 GPT-3.5 90 90 5 89 Llama2 77.5 82.5 13 128 Guanaco65B 55 39.5 65 131 Guanaco33b 27.5 27.5 65.6 190 such, the final score for hallucination is calculated as follows: correctedscore =score 20−numberofunansweredquestions×100 As shown in Figure3 andTable1, GPT-4 outperforms other models regarding factuality, completeness, and lack of hallucinations but is closely followed by StableBeluga2 and GPT- 3.5. The Guanaco models, based on Llama 1, perform significantly worse. The conciseness of the responses showed a similar pattern, except that StableBeluga2 produced the shortest answers (58 words),followedcloselybyMixtral8x7B(66words)andGPT-4(69 words). /six.tnum Discussion /six.tnum./one.tnum GPT-/four.tnum is the best, but open-source models follow closely GPT-4performsbestacrossallmeasuresbutiscloselyfollowed by StableBeluga2, Mixtral 8x7B, and GPT-3.5. Compared to GPT- 4, the cost per input token for GPT-3.5 is significantly lower./eight.tnum However,thehighercostsofGPT-4arepartiallycounteractedbyits /eight.tnumhttps://openai.com/pricing#language-models (accessed February /two.tnum/six.tnum, /two.tnum/zero.tnum/two.tnum/four.tnum).concise yet complete responses. If longer, more detailed responses were desired (e.g., for training purposes), the prompt could be adjusted.Weobservedthatthelesspowerfulmodels,suchasGPT- 3.5andLlama2,tendedtobewordierandincludeadditionaldetails thatwerenotdirectlyrequested.Incontrast,GPT-4,StableBeluga2, andMixtral8x7Bgeneratedmoreconciseresponses. The latest generation of open-source models, such as Mixtral 8x7B and Llama 2 variants, such as StableBeluga2, demonstrates a clear jump forward relative to their predecessors based on Llama- 1, which were more prone to hallucinations and exhibited poorer reasoning abilities over the context material. While open-source modelslikeStableBeluga2andMixtral8x7Bdonotscoreashighas GPT-4,theyensurebetterdatasecurity,privacy,andcustomization ifhostedlocally.Thiscanbeacrucialconsiderationforcompanies withsensitivedataoruniqueneeds. /six.tnum./two.tnum The tool is beneficial but inferior to human experts Users appreciate the system’s functionality and see it as a tool for modernizing factory operations and speeding up operations. They are keen on improvements to be made for better user Frontiersin ArtificialIntelligence /zero.tnum/seven.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum experience and utility, especially in the areas of content, feature enhancements, and user training. However, they express concerns aboutpotentialsafetyrisksandtheefficacyofinformationretrieval compared to consulting expert personnel. While these concerns are understandable, the tool was not designed to replace human- humaninteractions;instead,itcanbeusedwhennohumanexperts arepresentorwhentheydonotknoworrememberhowtosolvea specific issue. This would come into play during the night shift at thefactory where weconducted theuser study as a singleoperator operatesaproductionline,leavinglimitedoptionsforelicitinghelp fromothers. /six.tnum./three.tnum Limitations and future work We used the same prompt for all LLMs; however, it is possible that some of the LLMs would perform better with a prompt template developed explicitly for it. For consistency, we matched the LLMs’ hyperparameters (e.g., temperature) as closely as possible across all the tested models, except for Llama 2, as we did not have access to the presets as we did not host it locally.Ourmodelbenchmarkingprocedureinvolved20questions, and a singular coder assessed the responses. This introduces the potential for bias, and the limited number of questions may not cover the full spectrum of complexities in real-world scenarios. To mitigate these shortcomings, we varied query complexity and sourcematerialtypes. The study’s design did not include a real-world evaluation involving end users operating the production line, as this was consideredtooriskyforourindustrypartner.Suchanenvironment might present unique challenges and considerations notaddressed in this research, such as time pressure. Yet, by involving operators andmanagersandinstructingthemtoposeseveralquestionsbased on their actual work experience, we could still evaluate the system andcollectvalidfeedback. These limitations suggest directions for future research, for example, longitudinal studies where operators use the tool during production line operations and more comprehensive prompt and model customization. Longitudinal studies will be key to understanding the real-world impact on production performance, operatorwellbeing,andcognitiveabilities. /seven.tnum Conclusion The results demonstrated GPT-4’s superior performance over other models regarding factuality, completeness, and minimal hallucinations. Interestingly, open-source models like StableBeluga2 and Mixtral 8x7B followed close behind. The user study highlighted the system’s user-friendliness, speed, and logical functionality. However, improvements in the user interface and content specificity were suggested, along with potential new features. Benefits included modernizing factory operations and speedingupspecifictasks,thoughconcernsaboutsafety,efficiency, andinferioritytoaskinghumanexpertswereraised.Data availability statement The raw data supporting the conclusions of this article will be madeavailablebytheauthors,withoutunduereservation. Ethics statement The studies involving humans were approved by Human Research Ethics Committee (HREC) from TU Delft. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in thisstudy. Author contributions SK: Writing – original draft, Visualization, Software, Project administration,Methodology,Investigation,Formalanalysis,Data curation, Conceptualization. CW: Writing – original draft, Software,Methodology,Conceptualization.MF:Writing–original draft. SW: Writing – original draft. SR-A: Writing – original draft. EN: Writing – review & editing, Supervision, Methodology, Conceptualization. Funding The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by the European Union’s Horizon 2020 research and innovation program via the project COALA “COgnitive Assisted agile manufacturing for a LAbor force supported by trustworthy Artificial Intelligence” (Grant agreement957296). Conflict of interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict ofinterest. Publisher’s note All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher. Frontiersin ArtificialIntelligence /zero.tnum/eight.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum References Abid, A., Abdalla, A., Abid, A., Khan, D., Alfozan, A., and Zou, J . (2019).Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild . doi:10.48550/arXiv.1906.02569 Alkaissi, H., and McFarlane, S. I. (2023). Artificial hallucinat ions in chatgpt: implicationsinscientificwriting. Cureus15:e35179.doi:10.7759/cureus.35179 Alves, J., Lima, T. M., and Gaspar, P. D. (2023). Is industry 5.0 a human-centred approach?Asystematicreview. Processes 11.doi:10.3390/pr11010193 Badini, S., Regondi, S., Frontoni, E., and Pugliese, R. (2023 ). Assessing the capabilities of chatgpt to improve additive manufacturing trou bleshooting. Adv. Ind. Eng.Polym.Res .6,278–287.doi:10.1016/j.aiepr.2023.03.003 Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., et al. (2023). A multitask,multilingual,multimodalevaluationofchatgptonreas oning,hallucination, andinteractivity. arXiv.doi:10.18653/v1/2023.ijcnlp-main.45 Brown,T.,Mann,B.,Ryder,N.,Subbiah,M.,Kaplan,J.D.,Dhar iwal,P.,etal.(2020). “Languagemodelsarefew-shotlearners,”in AdvancesinNeuralInformationProcessing Systems,Vol.33 ,edsH.Larochelle,M.Ranzato,R.Hadsell,M.Balcan,andH.Lin(R ed Hook,NY:CurranAssociates,Inc.),1877–1901. Brückner,A.,Hein,P.,Hein-Pensel,F.,Mayan,J.,andWölke, M.(2023).“Human- centered hci practices leading the path to industry 5.0: a syste matic literature review,” inHCI International 2023 Posters , eds C. Stephanidis, M. Antona, S. Ntoa, and G. Salvendy(Cham:SpringerNatureSwitzerland),3–15. Dettmers,T.,Pagnoni,A.,Holtzman,A.,andZettlemoyer,L.( 2023).Qlora:Efficient FinetuningofQuantizedLlms .doi:10.48550/arXiv.2305.14314 Edwards, B., Zatorsky, M., and Nayak, R. (2008). Clustering a nd classification of maintenancelogsusingtextdatamining. DataMiningAnal .87,193–199. Fantini, P., Pinzone, M., and Taisch, M. (2020). Placing the o perator at the centre ofindustry4.0design:modellingandassessinghumanactivit ieswithincyber-physical systems.Comp.Ind.Eng .139:105058.doi:10.1016/j.cie.2018.01.025 Feng, S. C., Bernstein, W. Z., Thomas Hedberg, J., and Feeney , A. B. (2017). Towardknowledgemanagementforsmartmanufacturing. J.Comp.Inf.Sci.Eng .17:3. doi:10.1115/1.4037178 Gao, T., Fisch, A., and Chen, D. (2021). “Making pre-trained la nguage models better few-shot learners,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Con ference on Natural Language Processing (Volume 1: Long Papers) (Association for Computational Linguistics),3816–3830. Gröger, C., Schwarz, H., and Mitschang, B. (2014). “The manuf acturing knowledge repository - consolidating knowledge to enable holisti c process knowledge management in manufacturing,” in Proceedings of the 16th International Conference on Enterprise Information Systems (SCITEPRESS - Science and and Technology Publications),39–51.doi:10.5220/0004891200390051 Guest, G., MacQueen, K. M., and Namey, E. E. (2011). Appliedthematic analysis. ThousandOaks,CA:SagePublications. Jang, J., Ye, S., Lee, C., Yang, S., Shin, J., Han, J., et al. (20 22). Temporalwiki: a lifelong benchmark for training and evaluating ever-evolving la nguage models. arXiv. doi:10.18653/v1/2022.emnlp-main.418 Jawahar, G., Sagot, B., and Seddah, D. (2019). “What does BERT learn about the structure of language?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Florence. Association for Computational Linguistics), 3651–3657. Jeblick, K., Schachtner, B., Dexl, J., Mittermeier, A., Stübe r, A. T., Topalis, J., et al. (2023). ChatGPT makes medicine easy to swallow: an exploratory case study on simplifiedradiologyreports. Eur.Radiol .1–9.doi:10.1007/s00330-023-10213-1 Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B. , Bamford, C., et al. (2024).MixtralofExperts .doi:10.48550/arXiv.2401.04088 KernanFreire,S.,Foosherian,M.,Wang,C.,andNiforatos, E.(2023a).“Harnessing large language models for cognitive assistants in factories,” inProceedings of the 5th International Conference on Conversational User Interfaces, CUI ’23 (New York, NY: AssociationforComputingMachinery).doi:10.1145/357188 4.3604313 Kernan Freire, S., Wang, C., Ruiz-Arenas, S., and Niforatos , E. (2023b). “Tacit knowledgeelicitationforshop-floorworkerswithanintelligenta ssistant,”in Extended Abstractsofthe2023CHIConferenceonHumanFactorsinComputingSystems ,1–7. Kwon, B. C., and Mihindukulasooriya, N. (2022). “An empirical study on pseudo-log-likelihood bias measures for masked language mode ls using paraphrased sentences,” in TrustNLP 2022 -2nd Workshop on Trustworthy Natural Language Processing, Proceedings of the Workshop (New York, NY), 74–79. doi:10.1145/3544549.3585755 Leoni, L., Ardolino, M., El Baz, J., Gueli, G., and Bacchetti, A. ( 2022). The mediating role of knowledge management processes in the effecti ve use of artificial intelligence in manufacturing firms. Int. J. Operat. Prod. Manag . 42, 411–437. doi:10.1108/IJOPM-05-2022-0282Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). “Retrieval-augmentedgenerationforknowledge-intensiven lptasks,”in Proceedingsof the 34th International Conference on Neural Information Processing Sys tems, NIPS’20 (RedHook,NY:CurranAssociatesInc.). Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B. , et al. (2022).Code as Policies: Language Model Programs for Embodied Control . doi:10.48550/arXiv.2209.07753 Liu, J. (2022). LlamaIndex . Available online at: https://github.com/jerryjliu/llama_ index Maddikunta, P. K. R., Pham, Q.-V., B, P., Deepa, N., Dev, K., Ga dekallu, T. R., et al.(2022).Industry5.0:asurveyonenablingtechnologiesand potentialapplications. J. Ind.Inf.Integr .26:100257.doi:10.1016/j.jii.2021.100257 May, G., Taisch, M., Bettoni, A., Maghazei, O., Matarazzo, A. , and Stahl, B. (2015). A new human-centric factory model. Proc CIRP 26, 103–108. doi:10.1016/j.procir.2014.07.112 Müller, M., Alexandi, E., and Metternich, J. (2021). Digital sh op floor management enhanced by natural language processing. Procedia CIRP 96, 21–26. doi:10.1016/j.procir.2021.01.046 Nov, O., Singh, N., and Mann, D. (2023). Putting Chatgpt’s Medical Advice to the (Turing)Test .doi:10.48550/arXiv.2301.10035 Oruç,O.(2020).Asemanticquestionansweringthroughheter ogeneousdatasource inthedomainofsmartfactory. Int.J.Nat.Lang.Comput .9. Richter, S., Waizenegger, L., Steinhueser, M., and Richter , A. (2019). Knowledge management in the dark: the role of shadow IT in practices in man ufacturing. IJKM. 15,1–19.doi:10.4018/IJKM.2019040101 Semnani,S.J.,Yao,V.Z.,Zhang,H.C.,andLam,M.S.(2023). Wikichat:AFew-Shot Llm-BasedChatbotGroundedWithWikipedia .doi:10.48550/arXiv.2305.14292 Serrat, O. (2017). The Five Whys Technique. Knowledge Solutions: Tools, Methods, andApproachestoDriveOrganizationalPerformance ,307–310. Shneiderman,B.(2022). Human-CenteredAI .Oxford:OxfordUniversityPress. Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., et al. (2023). Large language models encode clinical knowledge. Nature620, 172–180. doi:10.1038/s41586-023-06291-2 Tang, R., Han, X., Jiang, X., and Hu, X. (2023). Does Synthetic Data Generation of LlmsHelpClinicalTextMining ?doi:10.48550/arXiv.2303.04360 Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., et al. (2023). Llama 2: Open Foundation And fine-Tuned Chat Models . doi:10.48550/arXiv.2307.09288 Trautmann, D., Petrova, A., and Schilder, F. (2022). Legal Prompt Engineering for MultilingualLegalJudgementPrediction .doi:10.48550/arXiv.2212.02199 Wang, X., Anwer, N., Dai, Y., and Liu, A. (2023a). Chatgpt for d esign, manufacturing,andeducation. Proc.CIRP 119,7–14.doi:10.1016/j.procir.2023.04.001 Wang, Z., Yang, F., Zhao, P., Wang, L., Zhang, J., Garg, M., et al. (2023b). Empower large language model to perform better on industrial dom ain-specific questionanswering. arXiv.doi:10.18653/v1/2023.emnlp-industry.29 Wei,C.,Wang,Y.-C.,Wang,B.,andKuo,C.C.J.(2023). AnOverviewonLanguage Models:RecentDevelopmentsandOutlook .doi:10.1561/116.00000010 Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., et al. (2022a). EmergentAbilitiesofLargeLanguageModels .doi:10.48550/arXiv.2303.05759 Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, b., Xia , F., et al. (2022b). “Chain-of-thoughtpromptingelicitsreasoninginlargelanguag emodels,”in Advances in Neural Information Processing Systems, Vol. 35 , eds S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Red Hook, NY: Curran Ass ociates, Inc.), 24824–24837. Wellsandt, S., Hribernik, K., and Thoben, K.-D. (2021). “Anat omy of a digital assistant,” in Advances in Production Management Systems. Artificial Intelligence for Sustainable and Resilient Production Systems , eds A. Dolgui, A. Bernard, D. Lemoine, G. von Cieminski, and D. Romero (Cham: Springer International Publishing), 321–330. Wen, C., Sun, X., Zhao, S., Fang, X., Chen, L., and Zou, W. (2023). Chathome: development and evaluation of a domain-speci fic language model for home renovation. ArXiv. doi: 10.48550/arXiv.2307.1 5290 Xia, Y., Shenoy, M., Jazdi, N., and Weyrich, M. (2023). Towar ds autonomous system: flexible modular production system enhanced with large lan guage model agents.arXiv.doi:10.1109/ETFA54631.2023.10275362 Xie, Q., Han, W., Zhang, X., Lai, Y., Peng, M., Lopez-Lira, A., et al. (2023a). Pixiu: Alargelanguagemodel,instructiondataandevaluationbenchma rkforfinance. arXiv. doi:10.48550/arXiv.2306.05443 Frontiersin ArtificialIntelligence /zero.tnum/nine.tnum frontiersin.org Kernan Freireet al. /one.tnum/zero.tnum./three.tnum/three.tnum/eight.tnum/nine.tnum/frai./two.tnum/zero.tnum/two.tnum/four.tnum./one.tnum/two.tnum/nine.tnum/three.tnum/zero.tnum/eight.tnum/four.tnum Xie, T., Wan, Y., Huang, W., Yin, Z., Liu, Y., Wang, S., et al. (2 023b). Darwin series: domain specific large language models for natural science.arXiv. doi:10.48550/arXiv.2308.13565 Xu,F.F.,Alon,U.,Neubig,G.,Hellendoorn,V.J.,andHel,V.J.(2 022).Asystematic evaluationoflargelanguagemodelsofcode. arXiv.doi:10.48550/arXiv.2202.13169 Xu, X., Lu, Y., Vogel-Heuser, B., and Wang, L. (2021). Industr y 4.0 and industry 5.0—inception, conception and perception. J. Manuf. Syst . 61, 530–535. doi:10.1016/j.jmsy.2021.10.006 Zhang,J.,Chen,Y.,Niu,N.,andLiu,C.(2023a).Apreliminary evaluationofchatgpt inrequirementsinformationretrieval. arXiv.doi:10.2139/ssrn.4450322Zhang, W., Liu, H., Du, Y., Zhu, C., Song, Y., Zhu, H., et al. (20 23b). Bridging the information gap between domain-specific model and genera l llm for personalized recommendation. arXiv.doi:10.48550/arXiv.2311.03778 Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., et al. (2023).A Survey of LargeLanguageModels .doi:10.48550/arXiv.2303.18223 Zuccon, G., Koopman, B., and Shaik, R. (2023). “Chatgpt hallucina tes when attributing answers,” in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP ’23 (New York, NY: Association for Computing Machinery), 46–51. Frontiersin ArtificialIntelligence /one.tnum/zero.tnum frontiersin.org